Back

JAMIA Open

35 training papers 2019-06-25 – 2026-03-07

Top medRxiv preprints most likely to be published in this journal, ranked by match strength.

1
Automated identification of unstandardized medication data: A scalable and flexible data standardization pipeline using RxNorm on GEMINI multicenter hospital data
2022-02-21 health informatics 10.1101/2022.02.16.22268694
#1 (19.3%)
Show abstract

ObjectivePatient data repositories often assemble medication data from multiple sources, necessitating standardization prior to analysis. We implemented and evaluated a medication standardization procedure for use with a wide range of pharmacy data inputs across all drug categories, which supports research queries at multiple levels of granularity. MethodsThe GEMINI-RxNorm system automates the use of multiple RxNorm tools in tandem with other datasets to identify drug concepts from pharmacy ord...

2
LLM-based data extraction for a large cancer registry, the Ontario Hereditary Cancer Research Network
2025-08-26 health informatics 10.1101/2025.08.20.25334127
#1 (18.4%)
Show abstract

ImportanceManual data extraction from genomic lab reports for on-line registries and databases is time-consuming for human resources such as clinical research coordinators. Automated tools, especially LLMs, can address these issues. Efficient and accurate data processing is crucial for building a reliable database. ObjectiveTo streamline the data extraction and curation process for genetic testing lab reports using an LLM-based approach. DesignNine sample molecular lab reports were selected fo...

3
Infectious, Allergic, and Immune-Mediated Disease Data Resources: A Landscape Overview and Subset Assessment
2025-07-30 health informatics 10.1101/2025.07.30.25332458
#1 (17.9%)
Show abstract

BackgroundThe Data Management and Sharing (DMS) Policy issued by the National Institutes of Health (NIH) requires most grant applications to include a DMS Plan, detailing data type(s), resources (e.g., data repositories, knowledgebases, portals) for data sharing, and a dissemination timeline. Researchers face challenges navigating the complex data landscape to identify data resources to fulfill the DMS Policy requirements. The National Institute of Allergy and Infectious Diseases (NIAID) aims to...

4
The Fault in Our Sets: A Mixed Methods Analysis of Clinical Value Set Errors
2025-03-01 health informatics 10.1101/2025.02.27.25323054
#1 (14.9%)
Show abstract

ObjectiveTo characterize clinical value set issues and identify common patterns of errors. Materials and MethodsWe conducted semi-structured interviews with 26 value set experts and performed root cause analyses of errors identified in electronic health records (EHRs). We also analyzed a random sample of user-reported issues from the Value Set Authority Center (VSAC), developing a categorization scheme for value set errors. Additionally, we audited medication value sets from three sources and a...

5
Development and validation of a generative AI-assisted medication-indication knowledge base
2026-01-06 health informatics 10.64898/2026.01.06.26343341
#1 (14.7%)
Show abstract

BackgroundExisting information resources about medicines and their indications have limited usefulness for health data analytics. The emerging potential of large language models (LLMs) to generate clinically accurate responses presents a novel opportunity to develop a comprehensive knowledge base of medicines and their clinical indications. MethodUnique medications from the English Prescribing Dataset (EPD) were extracted and included in a fine-tuned prompt pipeline using the GPT-4 and MedCAT L...

6
Phenotype Execution and Modelling Architecture (PhEMA) to support disease surveillance and real-world evidence studies: English sentinel network evaluation.
2023-11-22 health informatics 10.1101/2023.11.21.23298758
#1 (14.6%)
Show abstract

ObjectiveTo evaluate Phenotype Execution and Modelling Architecture (PhEMA), to express sharable phenotypes using Clinical Query Language (CQL) and intensional SNOMED CT Fast Healthcare Interoperability Resources (FHIR) valuesets, for exemplar chronic disease, sociodemographic risk factor and surveillance phenotypes. MethodWe curated three phenotypes: Type 2 diabetes (T2DM), excessive alcohol use and incident influenza-like illness (ILI) using CQL to define clinical and administrative logic. We...

7
Rx Norm for Europe - Toward the representation of medicinal products in the OMOP CDM: Graph visualization and validation of two mapping approaches using the OHDSI USAGI tool and LLM
2026-01-17 health informatics 10.64898/2026.01.15.26344216
#1 (14.5%)
Show abstract

Medication product names in Swiss electronic health records are heterogeneous and often encode multiple attributes (e.g., ingredient, strength, dose form, packaging) in German free text. This limits interoperability and reduces the utility of ATC codes, which do not uniquely identify products. We compared two workflows for mapping Swiss medication products to RxNorm and RxNorm Extension: (i) an Observational Health Data Sciences and Informatics (OHDSI) USAGI workflow with lexical similarity and ...

8
medExtractR: A medication extraction algorithm for electronic health records using the R programming language
2019-09-23 health informatics 10.1101/19007286
#1 (14.5%)
Show abstract

ObjectiveWe developed medExtractR, a natural language processing system to extract medication dose and timing information from clinical notes. Our system facilitates creation of medication-specific research datasets from electronic health records. Materials and MethodsWritten using the R programming language, medExtractR combines lexicon dictionaries and regular expression patterns to identify relevant medication information ( drug entities). The system is designed to extract particular medicat...

9
Knowledge Driven Phenotyping
2019-12-06 health informatics 10.1101/19013748
#1 (14.4%)
Show abstract

Extracting patient phenotypes from routinely collected health data (such as Electronic Health Records) requires translating clinically-sound phenotype definitions into queries/computations executable on the underlying data sources by clinical researchers. This requires significant knowledge and skills to deal with heterogeneous and often imperfect data. Translations are time-consuming, error-prone and, most importantly, hard to share and reproduce across different settings. This paper proposes a...

10
Sharing and Reusing Computable Phenotype Definitions
2023-09-18 health informatics 10.1101/2023.09.17.23295681
#1 (14.3%)
Show abstract

BackgroundA scalable approach for the sharing and reuse of human-readable and computer-executable phenotype definitions can facilitate the reuse of electronic health records for cohort identification and research studies. DescriptionWe developed a tool called Sharephe for the Informatics for Integrating Biology and the Bedside (i2b2) platform. Sharephe consists of a plugin for i2b2 and a cloud-based searchable repository of computable phenotypes, has the functionality to import to and export fr...

11
Accelerating precision medicine: a proposed framework for large-scale multiomics data integrity, interoperability, analysis, and collaboration in biomedical discovery
2024-03-18 health informatics 10.1101/2024.03.15.24304358
#1 (14.3%)
Show abstract

ObjectiveTo identify and define a process and framework for biomedical discovery research. Our study aim was to characterize the biomedical discovery lifecycle across data modalities and professional stakeholders involved in biomedical research to address the multiomics data challenges of precision medicine. Materials and MethodsWe recruited fifteen professionals from various biomedical roles and industries to participate in 60-minute semi-structured interviews, which involved an assessment of ...

12
From Spreadsheets and Bespoke Models to Enterprise Data Warehouses: GPT-enabled Clinical Data Ingestion into i2b2
2025-04-19 health informatics 10.1101/2025.04.17.25325962
#1 (14.0%)
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWO_ST_ABSObjectiveC_ST_ABSClinical and phenotypic data available to researchers are often found in spreadsheets or bespoke data models. Bridging these to enterprise data warehouses would enable sophisticated analytics and cohort discovery for users of platforms like NHGRIs Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVlL). We combine data mapping methodologies, biomedical ontologies, and large language models (LLMs) to load these data into Inf...

13
Extracting social determinants of health from electronic health records: development and comparison of rule-based and large language models-based methods
2025-11-17 health informatics 10.1101/2025.11.15.25339520
#1 (12.3%)
Show abstract

ObjectivesSocial determinants of health (SDoH) are critical drivers of health outcomes but are often under-documented in structured electronic health record data. This study aimed to develop and evaluate scalable methods for extracting seven SDoH domain categories and 23 subcategories from unstructured clinical notes using both rule-based and large language model (LLM)-based approaches. MethodsWe constructed a gold-standard SDoH corpus comprising clinical text segments from 171 patients in the ...

14
Development and Validation of Natural Language Processing Algorithms in the ENACT National Electronic Health Record Research Network
2025-01-27 health informatics 10.1101/2025.01.24.25321096
#1 (12.2%)
Show abstract

Electronic health record (EHR) data are a rich and invaluable source of real-world clinical information, enabling detailed insights into patient populations, treatment outcomes, and healthcare practices. The availability of large volumes of EHR data are critical for advancing translational research and developing innovative technologies such as artificial intelligence. The Evolve to Next-Gen Accrual to Clinical Trials (ENACT) network, established in 2015 with funding from the National Center for...

15
Evaluation of Patient-Level Retrieval from Electronic Health Record Data for a Cohort Discovery Task
2019-08-25 health informatics 10.1101/19005280
#1 (12.1%)
Show abstract

ObjectiveGrowing numbers of academic medical centers offer patient cohort discovery tools to their researchers, yet the performance of systems for this use case is not well-understood. The objective of this research was to assess patient-level information retrieval (IR) methods using electronic health records (EHR) for different types of cohort definition retrieval. Materials and MethodsWe developed a test collection consisting of about 100,000 patient records and 56 test topics that characteri...

16
A Comprehensive Approach to Days' Supply Estimation in a Real-World Prescription Database: Data Cleaning, Imputation, and Adherence Analysis
2025-09-05 health informatics 10.1101/2025.09.03.25335007
#1 (12.0%)
Show abstract

BackgroundFor accurate medication usage statistics and medication adherence calculations, we need to have an accurate days supply (DS) for each prescription. Unfortunately, often the DS or information needed for calculating the DS is not provided. Therefore, other methods need to be applied to acquire missing values or substituting incorrect values. ObjectiveThe aim of this study is to apply a variety of methods for managing incomplete and missing data to enhance the accuracy of calculating DS ...

17
Visualizing Geospatial and Temporal Phenotype Prevalences
2024-11-04 health informatics 10.1101/2024.11.01.24316603
#1 (12.0%)
Show abstract

High-throughput phenotyping strategies are capable of classifying large volumes of patients. However, translating this data to real world applications is challenging. We have developed GeoPheno, a tool which displays the geospatial prevalences of EHR-based phenotypes in the Veteran population over time. Our flexible tool can display data from a wide array of phenotypes and is integrated with the CIPHER phenotype library, allowing users to view the definitions of the conditions being visualized.

18
A Query Taxonomy Describes Performance of Patient-Level Retrieval from Electronic Health Record Data
2019-11-15 health informatics 10.1101/19012294
#1 (12.0%)
Show abstract

Performance of systems used for patient cohort identification with electronic health record (EHR) data is not well-characterized. The objective of this research was to evaluate factors that might affect information retrieval (IR) methods and to investigate the interplay between commonly used IR approaches and the characteristics of the cohort definition structure. We used an IR test collection containing 56 test patient cohort definitions, 100,000 patient records originating from an academic me...

19
Cohort Identification Using Semantic Web Technologies: Triplestores as Engines for Complex Computable Phenotyping
2021-12-05 health informatics 10.1101/2021.12.02.21267186
#1 (11.9%)
Show abstract

BackgroundComputable phenotypes are increasingly important tools for patient cohort identification. As part of a study of risk of chronic opioid use after surgery, we used a Resource Description Framework (RDF) triplestore as our computable phenotyping platform, hypothesizing that the unique affordances of triplestores may aid in making complex computable phenotypes more interoperable and reproducible than traditional relational database queries. To identify and model risk for new chronic opioi...

20
A systematic review of ontology-based clinical decision support system rules: usage, management, and interoperability
2022-05-16 health informatics 10.1101/2022.05.11.22274984
#1 (11.9%)
Show abstract

ObjectiveClinical decision support systems (CDSS) have a critical role in improving the quality and safety of health care delivery. CDSS rules direct the behavior of CDSS. However, the CDSS rules have not been routinely shared and reused, and ontology can promote the reusing of CDSS rules. We systematically screened literature to elaborate on the current status of ontology applied in CDSS rule management. MethodsWe searched PubMed, the Association for Computing Machinery (ACM) Digital Library, ...